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Using Anchoring Vignettes to Calibrate Teachers’ Self-Assessment of Teaching 
Background 

High-quality measures of instructional practice are essential for research and evaluation of 
innovative instructional policies and programs, as well as for providing feedback to teachers and 
administrators. Classroom observations are generally considered the “gold standard” for 
gathering rich infonnation about what teachers do. However, observation protocols can be time- 
consuming and costly to develop, validate, and implement (Pianta, Belsky, Vandergrift, Houts, & 
Morrison, 2008; Bill & Melinda Gates Foundation, 2012). 

Teachers’ self-reports through surveys and classroom logs are much more time-efficient and 
comparatively easy to administer over a large population. However, self-reported results suffer 
from potential biases due to differences in respondents’ understanding of the latent constructs 
being measured, as well as interpretation and application of the scale used to quantify their 
practices (Hill, 2005; Chevalier & Fielding, 2011). For example, teachers might have different 
understanding of what constitutes a cognitively demanding learning task when responding in a 
survey about the extent to which their students are engaged in such tasks. Furthermore, a high 
level of an instructional practice as perceived by one teacher may be a lower level as perceived 
by another. These differences lead to incomparability between teachers’ answers and contribute 
to the lack of validity of teachers’ self-reported teaching measures observed in prior research 
(Mayer, 1999; Hill, 2005; Stecher, Le, Hamilton, Ryan, Robyn, & Lockwood, 2006). 

The use of anchoring vignettes in surveys has the potential to diagnose and, in some cases, 
address the factors that likely lead to inaccuracy in teacher self-reports about their instruction 
(King, et al, 2004). When using this method, researchers provide an operational definition of an 
abstract construct to be measured through a hypothetical scenario with detailed descriptions of 
the cognition and behaviors of individuals similar to the respondents. Researchers change the 
cognition or behaviors of individuals in the scenario and generate different versions of vignettes 
to represent different levels of the underlying construct. A group of subject experts rates each 
vignette using the same scale that respondents will use to rate these vignettes themselves and 
identify a point on the scale that corresponds to each vignette. 

In a typical survey that uses the anchoring vignettes method, respondents are presented with all 
anchoring vignettes and asked to rate the individual in each vignette as well as themselves on the 
latent construct of interest. Respondents’ ratings on anchoring vignettes are used to calibrate 
their self-assessments to a common scale so that the adjusted self-assessment results are more 
comparable across respondents than the raw self-assessment results (Wand, King, & Lau, 2011). 
This method has been successfully implemented in political science, health, and other fields and 
found to be useful to calibrate survey self-reports and provide more comparable results across 
respondents (King, et al., 2004; Grol-Prokopczyk, Freese, & Hauer, 2011; Soest, et al., 2011). 

Purpose of Study 

In this study, we examined whether using anchoring vignettes in web-based surveys improved 
the validity of teachers’ self-assessments of their mathematics instruction. To investigate 
validity, we compared correlations between teachers’ self-ratings and other measures of teaching 
including teachers’ value-added scores, student surveys, and observation ratings of instruction 
before and after calibration to examine whether calibration improves the correlation between 
teachers’ self-ratings and other teaching measures. 
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Significance of Study 

This is the first study that uses anchoring vignettes to calibrate teacher self-assessment of 
teaching. In this study, we will provide a rigorous test of whether response scale differences are 
partly responsible for the current lack of alignment between teachers’ survey self-reports about 
their instruction and observations of those teachers’ instructional practices. If successful, this 
innovative method could contribute to the development of a new generation of survey 
instrumentation for assessing mathematics teachers’ instruction. Such advancements in survey 
methods could greatly inform research on the antecedents and effects of instructional practices, 
as well as provide rapid feedback to school leaders and teachers themselves about their work. 

Participants 

Data came from 61 mathematics teachers in grades 4-9 participating in the Bill & Melinda Gates 
Foundation’s Measures of Effective Teaching Extension project. The sample was roughly evenly 
distributed between elementary (grades 4-5) and middle school (grades 6-8) teachers, with only 
three percent teaching 9 th grade. Eighty percent of these teachers were female. About two-thirds 
were white, one quarter were black, and the remaining teachers were Hispanic. On average, they 
had about six years of teaching experience in their current districts, and over 40 percent had at 
least a master’s degree. 

Data Collection and Analysis 

We worked with mathematics education experts to identify six dimensions of mathematics 
instruction on which to focus in our survey: (1) emphasis on mathematical vocabulary; (2) 
questioning; (3) emphasis on student effort; (4) use of instructional time; (5) use of cognitively 
challenging tasks; and (6) remediation. For each dimension, we designed a series of four 
anchoring vignettes representing four different levels of practice for that dimension. 

Teachers completed an online survey within one to two days after video-recording a mathematics 
lesson. In the survey, teachers rated themselves on these six dimensions for their videotaped 
lesson. For each dimension, teachers also rated four short anchoring vignettes representing 
hypothetical classrooms where differing levels of the dimension were present. Finally, they were 
asked to rank the vignettes, along with their own practices, according to the definitions of each 
practice provided. 

We administered the survey in two waves from January to June 2013, with questions for three 
randomly-chosen dimensions included in each wave. In the first wave, self-ratings came before 
the vignette ratings for each dimension; in the second wave, self-ratings came after vignette 
ratings. Trained raters (two raters per video) scored the videotaped lessons using a rubric that 
captures the same dimensions as the survey, and they worked together to reconcile any 
disagreements in their ratings. 

We also have a composite teacher performance measure for each teacher, drawn from the 
original Measures of Effective Teaching project (Bill & Melinda Gates Foundation, 2012). The 
composite measure is a combined score based on an array of measures, including teachers’ 
value-added scores, results form student surveys, and classroom observations. 
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Statistical Model 

We used both non-parametric and parametric methods to calibrate teachers’ self-ratings. The 
non-parametric calibration method recodes teachers’ categorical self-ratings relative to each set 
of anchoring vignettes. Let v, be the categorical survey self-assessment for teacher i (7 = 1, . . ., N) 
and zij be the categorical survey responses for respondent i on vignette j (j = 1, . . ., J). For 
respondents who ranked all four vignettes on each dimension in the same order as the panel of 
experts (zij.i < z,, for all /,/), the calibrated self-rating is 

1 if y t < z a 

2 if Vi = Zn 

3 ifzn<yi<Zi2 (!) 

2/ + 1 if y t > z i; 

Inconsistencies in the ordinal ranking of vignettes are grouped and treated as ties. Respondents 
with ties in the vignette ratings would receive an interval value for C instead of a scalar value. 

The parametric method models the construct with random measurement error and allows the 
thresholds that turn the unobserved perceived variable into an observed categorical response to 
vary over individuals as a function of measured explanatory variables. 

Let Hi represents respondent i's actual level on the underlying construct to be measured. Assume 
Hi is on a continuous, unbounded, and uni-dimensional scale with higher values indicating higher 
levels on the interested construct. This actual level varies over respondents as a liner function of 
observed covariates X t with coefficient /? and an independent normal random effect r] L . 

Ah — Xif + rji f1i~N (0, u> 2 ) (2) 

The parametric method assumes respondents perceive their actual levels on the interested 
construct only with random errors. Let Y* s represent respondent V s unobserved continuous 
perceived level on the self-rating question. 

Yf-N^a 2 ) (3) 

Respondent i answered self-rating question 5 with K s ordinal response categories. S/he turns the 
unobserved perceived level of Y* into the reported category v, via this observation mechanism: 

Yi = k if rf" 1 < Y* < rf (4) 

The vector of thresholds T t (where t° = — oo, rf = +oo, rf -1 < rf, k= 1, . . ., K) varies over 
respondents as a function of covariates L, and unknown parameter vectors y 

t l = ¥% ( 5 ) 

rf = rf -1 + y k Vi k = 2, ... , K - 1 

Let 6j represent the hypothetical person’s actual level on the latent construct in vignette j (j = 1, 
..., J). Zf represents respondent i’s unobserved perceived level of the hypothetical person in 
vignette j. Respondent i perceives 6j with random normal error 

Zij~N (6j, a 2 ) (6) 
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Similar to the self-rating process, respondent j turns the unobserved perceived level of Zf into 
the reported category zy via a similar observation mechanism: 

ztj =k if rf _1 < Z* < x\ (7) 

The thresholds are detennined by the same y coefficient used for y t , and the same explanatory 
variables but with values measured for 

T h = 7% (8) 

4 = rt 1 + k — 2 K- 1 

We implemented the non-parametric and parametric models using the R package anchors (Wand, 
King, & Lau, 2011). Covariates used in X L and V t include teachers’ gender, ethnicity, years of 
teaching experience, and whether the teacher had a master’s degree. Then we examined the 
correlation between teachers’ self-ratings and the composite teacher perfonnance measure before 
and after the calibration. 

We also investigated whether teachers rank the vignettes the way we intended; which dimensions 
teachers and observers rate reliably; and how survey and observation ratings compare. 

Findings / Results: 

Preliminary findings suggest that anchoring vignettes represent a promising innovation for 
measuring teachers’ instruction through survey self-reports. Specifically, we found: 

• Teachers’ survey responses that are calibrated through the use of anchoring vignettes 
have increased variation compared to teachers’ raw survey responses, particularly for the 
cognitive challenge dimension; 

• Teachers’ calibrated survey responses regarding mathematical vocabulary and 
cognitively challenging tasks are more strongly correlated with the composite measure of 
teacher perfonnance compared to raw survey responses; 

• If teachers gave their self-rating after rating the vignettes, rather than before, the entire 
collection of calibrated self-ratings are significantly conelated with the composite 
performance measure (p<.05). 

Conclusions: 

These findings suggest that anchoring vignettes improve the accuracy of teachers’ self-reports, 
which has implications for how researchers and practitioners can efficiently gather and learn 
from instructional data. 

Limitations of this study include restricted sample size and lack of evidence regarding validity of 
the measure for specific purposes such as use in a high-stakes evaluation system or use as a 
source of information to inform professional development. 

Future research may refine the practices examined in this study; incorporate practices that we 
haven not studies as carefully such as classroom climate; identify the specific teacher-student 
interactions that characterize each level of a given dimension; and study how to use anchoring 
vignettes or vignettes calibrated survey results for different purposes such as professional 
development or making high-stakes decisions. 
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